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Abstract 

Background: RNA-binding proteins regulate a number of cellular processes, including synthesis, folding, 
translocation, assembly and clearance of RNAs. Recent studies have reported that an unexpectedly large number of 
proteins are able to interact with RNA, but the partners of many RNA-binding proteins are still uncharacterized. 

Results: We combined prediction of ribonucleoprotein interactions, based on catRAPID calculations, with analysis of 
protein and RNA expression profiles from human tissues. We found strong interaction propensities for both 
positively and negatively correlated expression patterns. Our integration of in silico and ex vivo data unraveled two 
major types of protein-RNA interactions, with positively correlated patterns related to cell cycle control and 
negatively correlated patterns related to survival, growth and differentiation. To facilitate the investigation of 
protein-RNA interactions and expression networks, we developed the catRAPID express web server. 

Conclusions: Our analysis sheds light on the role of RNA-binding proteins in regulating proliferation and 
differentiation processes, and we provide a data exploration tool to aid future experimental studies. 



Background 

With the advent of high-throughput proteomic and 
transcriptomic methods, genome- wide data are giving 
previously unprecedented views of entire collections of 
gene products and their regulation. Recently, approaches 
based on nucleotide-enhanced UV cross-linking and 
oligo(dT) purification have shown that a number of pro- 
teins are able to bind to RNA [1,2]. 

RNA-binding proteins (RBPs) are key regulators of 
post-transcriptional events [3] and influence gene ex- 
pression by acting at various steps in RNA metabolism, 
including stabilization, processing, storing, transport and 
translation. RBP-mediated events have been described 
using recognition and regulatory elements in RNA 
sequences [4,5] as well as expression profiles [6] that 
are tissue specific and conserved across species [7-9]. 
Although heterogeneity in gene regulation is responsible 
for phenotypic variation and evolution [10], very little is 
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known about constitutive expression patterns controlled 
by RBPs [11,12], which are the subject of this work. 

Data from recent transcriptomic and proteomic 
studies [13,14] are becoming attractive for studying 
mechanisms of gene regulation [15,16]. Despite the 
increasing amount of genomic data, the development 
of computational methods for integrating, interpreting 
and understanding molecular networks remains challen- 
ging [17,18]. Here we combine our predictions of 
protein-RNA interactions, based on catRAPID calcula- 
tions [19,20], with the information obtained from 
expression data to investigate constitutive regulatory 
mechanisms. The catRAPID approach has been previously 
employed to predict protein associations with non-coding 
RNAs [21,22] as well as ribonucleoprotein interactions 
linked to neurodegenerative diseases [23,24]. Our theoret- 
ical framework has been used to unravel self-regulatory 
pathways controlling gene expression [25]. The catRAPID 
omics algorithm, validated using photoactivatable-ribonu- 
cleoside-enhanced cross-linking and immunoprecipitation 
(PAR-CLIP) data, has been recently developed to predict 
protein-RNA associations at the transcriptomic and 
proteomic levels [26]. 
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Using comprehensive and manually annotated databases 
of expression profiles in human tissues, at both protein 
and RNA levels, we investigated the correlation between 
RBP activity and regulation. The link between interaction 
propensity and expression levels was exploited to reveal 
the fine-tuned functional sub-networks responsible for 
regulatory control To explore the results further, we 
developed the catRAPID express web server [27] . 

Results 

In this study, we focused on the mRNA interactomes 
of RBPs detected through nucleotide- enhanced UV 
cross-linking and oligo(dT) purification approaches [1,2]. 
Exploiting gene ontology (GO) annotations [28] for 
protein-coding genes, we systematically analyzed protein- 
RNA interactions and expression data for human tissues. 

At present, few studies have investigated how altering 
protein expression affects the abundance of RNA targets. 
Interrogating the Gene Expression Omnibus (GEO) [29] 
and ArrayExpress databases [30], we found two human 
proteins, ELAV-like protein 1 (or human antigen R, 
HuR) [31] and Protein lin-28 homolog B (LIN28B) 
[32,33], whose knock-down has been shown to alter 
the expression of target genes identified by PAR-CLIP 
(see Materials and methods). 

Our predictions, made using the catRAPID algorithm 
[26], identified experimentally validated interactions with 
high significance (HuR: P = 10~ 8 ; LIN28B: P = 10~ 3 ; Fish- 
er s exact test; see Materials and methods). The interac- 
tions were effectively discriminated from non-interacting 
pairs using score distributions (LIN28B: P = 10" 4 ; HuR: 
P = 10" 16 ; Students t-test; see Materials and methods). 
Hence, catRAPID is very good at predicting physical in- 
teractions between a protein and RNA partners (other 



statistical tests are given in Materials and methods and 
Additional file 1). 

To understand the regulation of HuR and LIN28B 
targets better, we studied the relation between inter- 
action propensities and expression levels. We found 
that the expression of predicted HuR targets is altered 
(log-fold change, LFC) when HuR is knocked down 
(P < 10" 5 ; Kolmogorov-Smirnov test; Figure 1A), which 
is in agreement with experimental data [31]. Similarly, 
predicted LIN28B targets are downregulated upon 
protein depletion (P < 10" 2 ; Kolmogorov-Smirnov test; 
Figure IB), as shown in a previous study [33]. Moreover, 
we compared the top 1% of predicted associations with 
the top 1% of experimental interactions and found the 
same enrichments for transcripts changing in expression 
levels upon protein depletion. Specifically, 62% of HuR 
experimental interactions and 63% of HuR predicted 
associations had LFC > 0. Similarly for LIN28B, 57% of 
experimental interactions and 56% of predicted associa- 
tions had LFC > 0. 

These HuR and LIN28B examples indicate that 
changes in protein expression influence the abundance 
of RNA targets, suggesting that a large-scale analysis 
of co-expression and interaction propensities could 
improve understanding of RBP-mediated regulatory 
mechanisms. 

RNA-binding protein-mRNA interactions and relative 
expression profiles 

Our predictions indicate that interacting molecules have 
both more correlated and anti-correlated expression 
patterns (see Materials and methods and Figure 2). By 
contrast, non-correlated expression is not associated 
with any enrichment in interaction propensity (Additional 



Function ^ 
o 

bo _l 




B 






- I 1 I 1 I ~ 
— non-interacting 


Function 

o 

bo _i 


- 1 1 1 1 \_ - 




— interacting J 






ibution 
o 




ibution 

p 






CO 




CO 






b 0.4 




Q 0.4 






> 




> 






c5 

3 0.2 




CO 

3 0.2 






E 




E 






o 

0 


J 1 , l _ 


O 

0 


^1 1 , 1 _ 




-0.5 0 0.5 
LFC log(HUR expression/knockdown) 


-0.5 0 0.5 
LFC log(LIN28B expression/knockdown) 




Figure 1 Relation between protein and RNA regulation. (A) HuR interactome: our predictions, made using catRAPID [26], indicate that 


expression levels of RNA targets change upon HuR knock-down (log-fold changes, LFC), in agreement with experimental evidence [31] (P< 10~ 5 ; 


Kolmogorov-Smirnov test). (B) LIN28B interactome: RNA targets are downregulated upon LIN28B knock-down (LFC), as reported in a previous 


study [33] (P< 10" 2 ; Kolmogorov-Smirnov test). In this analysis, the prediction of the interactions was highly significant (HuR: P< 10" 8 ; LIN28B: 

P< 10~ 3 ; Fisher's exact test). Our results indicate that changes in protein expression influence the abundance of RNA targets to a significant extent. 


HuR, human antigen R; LFC, log-fold change; LIN28B, lin-28 homolog B. 
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Figure 2 Protein-RNA interaction and expression. (A) In this analysis, we compared interacting and non-interacting protein-RNA pairs at 
different interaction propensity scores. Areas under the curve (AUCs), expressed as percentages, were used to select the same number of interacting 
and non-interacting protein-RNA pairs. (B) The same procedure was used to investigate positively and negatively correlated protein-RNA 
expression at different thresholds. (C) With respect to non-interacting protein-RNA pairs, the predicted associations had enriched positively 
correlated expression (that is, co-expression; see Materials and methods). (D) Compared to non-interacting protein-RNA pairs, the predicted 
associations had enriched negatively correlated expression (that is, anti-expression; see Materials and methods). Non-correlated protein-RNA 
expression did not show any similar trend (Additional file 1). AUC, area under the curve. 

V J 



file 2: Figure S1A). We observed the same results using 
immunohistochemistry [34] and RNA sequencing data [6] 
to estimate protein abundances (Additional file 2: Figures 
SIB and S2; see Materials and methods). This finding is 
truly remarkable. Direct proportionality between protein 
and mRNA expression levels has been observed in bac- 
teria and fungi [13,14] but post-transcriptional modifica- 
tion is known to influence the overall abundance of the 
protein product in higher eukaryotes [35]. Since immuno- 
histochemistry only provides a qualitative estimate of the 
amount of protein (see Materials and methods) and the 
analysis is restricted to 612 proteins, we used RNA se- 
quencing for our predictions (1,156 RBPs). 

The enrichment shown in Figure 2 suggests that a 
good relation exists between interaction and expression 
of protein-RNA molecules, which should have co- 



evolved to be either co-expressed or anti-expressed to 
exert a regulatory function (Figure 2C,D). 

Conservation of expression pattern for functionally 
related genes 

We classified protein-RNA associations into four categor- 
ies: interacting and co-expressed (IC), interacting and 
anti-expressed (IA), non-interacting and co-expressed 
(NIC) and non-interacting and anti-expressed (NIA). We 
applied conditional tests on each subset to detect signifi- 
cantly over-represented gene ontology (GO) terms (see 
Materials and methods and Additional file 3: Table SI). 

For high interaction propensities, transcripts in the IC 
subset have more processes associated with cell cycle 
control, in particular the negative regulation of prolifera- 
tion (Discussion; Additional file 3: Table SI). 
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Transcripts interacting with anti-expressed proteins 
(IA subset) are involved in survival, growth and differen- 
tiation processes and have more regulative functions at 
the DNA level (Discussion; Additional file 3: Table SI). 

No clear functional assignments and/or insufficiently 
populated GO terms were found for transcripts in non- 
interacting protein-RNA pairs (NIC and NIA subsets). 

Intrinsic disorder and RNA-binding protein interaction 
propensity 

Recent findings suggest that RBPs have more structurally 
disordered regions [1]. To investigate the relation be- 
tween disorder and RNA-binding ability, we used the 
IUPred algorithm [36]. For each protein, we extracted 
structurally disordered regions (IUPred score > 0.4 [1]) 
and calculated the interaction propensities with human 
transcripts. We considered both canonical RBPs (that is, 
containing RNA-binding domains) and putative RBPs 
(that is, lacking RNA-binding domains) [1]. With respect 
to the RNA-binding ability of full-length sequences, the 
contribution of disorder is higher at low interaction pro- 
pensity scores and becomes negligible at high interaction 
propensities (see Materials and methods and Figure 3A). 
Nevertheless, the role of structural disorder is more 
pronounced in proteins lacking canonical RNA-binding 



domains, indicating that unfolded regions might be able 
to promote interactions with RNA (Figure 3B). 

In a previous study we observed that catRAPID scores 
correlate with chemical affinities [21], which suggests 
that the interaction propensity can be used to estimate 
the strength of association [21,26]. Hence, our results in- 
dicate that structural disorder might contribute to low- 
affinity interactions with RNA (Figure 3A,B), which is in 
agreement with what has been observed for protein- 
protein associations [37,38]. As a matter of fact, it has 
been reported that disorder regions are able to promote 
promiscuous and non-specific interactions [39]. 

Discussion 

Because they are associated with transcriptional control 
of gene expression, RBPs play fundamental roles in 
health and disease. Indeed, by binding to their target 
mRNAs, RBPs can influence protein production at dif- 
ferent levels (transcription, translation and protein/ 
mRNA degradation). Protein-RNA complexes are very 
dynamic and can undergo extensive remodeling. Thus, 
they can control the spatiotemporal regulation of target 
gene expression and the overall switching on and off oi 
the distinct sets of genes involved in biological processes 
such as cell cycle progression, cell differentiation, cell 




Interaction Propensity 

(Full-length sequence) Interaction Propensity 



Figure 3 RNA-binding ability and structural disorder. (A) For each protein, we calculated RNA interactions with full-length sequences as 
well as structurally disordered regions [1,36]. When the interaction propensity score of a disordered region exceeds that of the full-length protein 
(points above the red line), disorder is considered to promote interaction with RNA molecules. (B) For 66% of the proteins (137 entries), disorder 
contributes at low interaction propensities, while full-length protein sequences dominate at high interaction propensities (Mann-Whitney U test). 
Overall, from low to high interaction propensities, the contribution of disorder decreases progressively with respect to that of the full-length 
protein (red and grey lines), in agreement with a previous analysis [25]. The role of disorder is more relevant in proteins lacking canonical 
RNA-binding domains (grey line), indicating that unstructured regions might have direct involvement in contacting RNA. Interaction propensities 
are averaged per protein. RBD, RNA-binding domain. 
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response to metabolic stimuli and stress conditions, 
organ morphogenesis and embryonic development. 

Co-expression and interaction propensity are features of 
cell cycle control 

At high interaction propensities (AUC > 95%; see Materials 
and methods), the IC subset has more GO terms linked to 
cell cycle control and housekeeping functions such as 
nucleobase metabolism and purine biosynthesis (Figure 4 
and Additional file 3: Table SI). In particular, mRNAs 
interacting with co-expressed RBPs code for negative regu- 
lators of cell proliferation and migration (translation, 
signaling and metabolite utilization). We found a number 
of tumor suppressors in the IC subset (AHRR, BAX, 
BRMS1, CDKN1A, CDKN2A, CTBP1, DAB2IP, DKK3, 
FLCN, FOXP1, GADD45G, GALR1, GTPBP4, HIC1, 
IGFBP3, IRF8, KLF4, MEN1, MLH1, NF2, NR0B2, PARK2, 
PAWR, PAX4, PAX5, PCGF2, PHB, PML, PPP1R1B, 
PPP2R4, PTPRJ, PYCARD, RHOA, SIRT2, TFAP2A, 
TNFAIP3, TRIM24, TSC2, TSG101, UCHL1). Interestingly, 
90% of IC genes annotated with more functional categories 
(381 out of 422) are listed in the gene index of the National 
Institutes of Healths Cancer Genome Anatomy Project 
[40]. Terms associated with inhibition of cellular pathways 
(especially the negative regulation of phosphorylation and 
regulation of protein serine/threonine kinase activity) are 
also more prevalent in the IC subset when immunochemis- 
try data are used. 

As mutations altering tumor suppression lead to aber- 
rant proliferative events, we speculate that downregula- 
tion of specific genes is a mechanism for preventing 
indiscriminate cellular growth. In agreement with this 
hypothesis, it has been reported that somatic loss of 
function of the tumor suppressor tuberous sclerosis 2 



(TSC-2) leads to the development of benign and malig- 
nant lesions in the myometrium, kidney and other tis- 
sues sharing common features such as a low rate of 
renewal and defects in the mitochondrial respiratory 
chain associated with oncogenesis [41,42]. This gene is 
annotated in all the functional categories prevalent in 
the IC subset. Intriguingly, it is predicted that TSC-2 
mRNA interacts strongly with Nuclear Protein 5A 
(NOP56). The interaction propensity is 175 correspond- 
ing to an AUC of 99.5%. This protein is an essential com- 
ponent of the splicing machinery [43] that is differentially 
expressed in leiomyoma and downregulated in response 
to hypoxia [44]. It is possible that hypoxia-dependent re- 
pression of NOP56 expression [45-47] is a protective 
mechanism against fast growth and potential tumor pro- 
gression. Indeed, it has been reported that NOPS6 and 
TSC-2 are not differentially expressed in renal carcinomas 
and oncocytomas [48,49] (ArrayExpress: E-GEOD-12090; 
ArrayExpress: E-GEOD- 19982), indicating loss of regula- 
tion during malignant progression. 

Based on these observations, we propose that down- 
regulation of RBPs promoting the translation of dysfunc- 
tional tumor suppressors can prevent indiscriminate 
cellular growth and that loss of control can destine a 
cell to malignancy (additional examples are reported in 
Additional file 1). 

Anti-expression and interaction propensity are features of 
repressing processes 

For AUC > 95%, the IA subset has more terms associated 
with cell differentiation processes (for example, prox- 
imal/distal pattern formation) as well as inflammation 
(for example, positive regulation of isotype switching), 
which are known to be tightly linked [50-52]. In fact, a 
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Figure 4 GO enrichment for interacting mRNA-RBP pairs correlated in expression (IC subset). Using the catRAPID score distribution, we 
counted mRNA GO enrichment associated with different areas under the curve (see Materials and methods). The color gradient (yellow to red) 
indicates the AUC values (number of interactions: 20,702,804 for AUC > 50%, 10,351,402 for AUC > 75%, 2,070,280 for AUC > 95%). We found that 
cell cycle processes have more highly interacting mRNA-RBP pairs (AUC > 95%) that are correlated in expression. AUC, area under the curve; GO, 
gene ontology; IC, interacting and co-expressed; RBP, RNA-binding protein. 
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number of differentiation cytokines (IL18, IL23 and 
EBI3/IL27) and stimulators of cytokine production 
(CD28 and CD80CCR2/CD192) are in the subset More- 
over, a large fraction of entries is also linked to protein- 
DNA complex assembly and regulation of transcription 
initiation from RNA polymerase II promoter (Figure 5 
and Additional file 3: Table SI). It has been shown that 
94% of genes in I A enriched functional categories (124 
out of 132) are listed in the annotated gene index of the 
National Institutes of Healths Cancer Genome Anatomy 
Project [40]. Remarkably, terms clearly associated with 
cell differentiation and inflammation (especially regula- 
tion of embryonic development and B cell activation in- 
volved in immune response) are more prevalent in the 
IA subset when immunochemistry data are used. 

IA genes share the common functional property of 
regulating survival, growth and differentiation processes. 
As RBPs play a crucial role in repressing gene expression 
[53,54], IA associations could be involved in the regula- 
tion of proliferative events. Indeed, adult tissues are 
constantly maintained at the steady state [13] but a dra- 
matic reawakening of growth, survival and differenti- 
ation genes occur in either physiological conditions 
(for example, wound healing [50]) or pathological pro- 
gression to cancer [55]. 

In the IA set, we found YTHDC1 (YT521-B), which is 
a ubiquitously expressed member of the novel RNA- 
binding YTH-domain family [56]. YTHDC1 represses 
gene expression by either sequestering splicing factors 
or directly binding to transcripts [57-59] (Additional file 
2: Figure S5A). Among the transcripts that we predict to 
be potentially targeted by YTHDC1, we found several 
proto-oncogenes or tumor-associated genes such as 
RET, PRMT2, RARG and HOXA9 (RET: interaction 



propensity = 166; PRMT2: interaction propensity = 209; 
RARG: interaction propensity = 194; HOXA9: interaction 
propensity = 165; all corresponding to an AUC of 99.5%). 
In particular, alternatively spliced variants of PRMT2 
were related to survival and the invasiveness of breast 
cancer cells [60,61], while high expression of RARG and 
HOXA9 has been observed in human hepatocellular 
carcinomas and acute leukemia [62,63]. We hypothesize 
that perturbation of the regulation by YTHDC1 of 
potentially oncogenic genes such as RET, PRMT2, RARG 
and HOXA9 could be involved in the pathogenesis of 
related tumors. In fact, experimental studies support 
the implications for YTHDC1 in cancer progression 
with regard to angiogenesis, growth factor signaling, 
immortalization, genetic instability, tissue invasion and 
apoptosis [59,64,65]. 

Similarly, the translational silencer TIA-1, also re- 
ported to induce mRNA decay [66-68], is predicted to 
interact with the ubiquitously expressed NAP1L1 tran- 
script (interaction propensity = 113 corresponding to an 
AUC of 95%), consistent with iCLIP data for HeLa cells 
(ArrayExpress: E-MTAB-432) [69] (Additional file 4: 
Table S2). Deregulation of NAP 1 LI expression has been 
documented for several tumors such as small intestine 
carcinoid neoplasia [70], neuroendocrine tumors [71], 
ovarian cancer [72] and hepatoblastomas [73]. We 
hypothesize that TIA-1 plays a fundamental role in the 
post-transcriptional regulation of NAP1L1 and that alter- 
ation of this regulatory process contributes to NAP1L1- 
associated tumor development. 

We note that repression of aberrant interactions can 
be achieved by gene silencing, which prevents the poten- 
tial stabilizing action of RBPs on specific transcripts 
(Additional file 2: Figure S5B). For instance, the Nodal 



I 8 

o £ 



50% 75% 95% 




survival, growth and differentiation 



CHROMATIN STATE AND DNA BINDING 

protein-DNA complex assembly 
DNA recombination 

DEVELOPMENT AND DIFFERENTIATION 

proximal/distal pattern formation 
programmed cell death 
neurotransmitters and neuromodulators 
inflammatory signaling 



— 



100 



200 



300 



400 



500 



catRAPID score 

Figure 5 GO enrichment for interacting mRNA-RBP pairs anti-correlated in expression (IA subset). Using the catRAPID score distribution, 
we evaluated mRNA GO enrichment associated with different areas under the curve (see Materials and methods). A color gradient (cyan to blue) 
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gene is normally silenced in adult tissues and its expres- 
sion is associated with tumor progression [74]. Since 
Nodal is a member of the Transforming Growth Factor 
p (TGFB) superfamily and controls mesoderm formation 
and axial patterning during embryonic development 
[74], it is possible that Nodal interactions with specific 
RBPs lead to pathogenesis in adult tissues. Our predic- 
tions indicate that the transcript Nodal interacts 
with a number of anti-expressed RBPs (ADD1, API5, 
ARCN1, CANX, CAPRIN1, CCT6A, DKFZP434I0812, 
GSPT1, HSP90AB1, PKM, PUF60, XRCC5, YTHDC1 
and YWHAZ). Since the exact mechanism regulating 
Nodal is at present unknown, we generated a list of pro- 
tein partners that could be exploited for future experi- 
mental studies (Additional file 5: Table S3). 

Conclusions 

Comparative expression studies provide important in- 
sights into biological processes and can lead to the discov- 
ery of unknown regulation patterns. While evolutionary 
constraints on tissue-specific gene expression patterns 
have been extensively investigated [7-9,75,76], the consti- 
tutive regulation of RBP-mediated interactions is still 
poorly understood [11,12]. It has been previously observed 
that cellular localization and gene expression levels impose 
stringent conditions on the physicochemical properties of 
both protein and RNA sequences [77,78], but large-scale 
computational analyses of constitutive RBP-mediated 
regulatory networks have never been attempted before. 
Our study shows for the first time that the integration of 
in silico predictions [19] with ex vivo expression profile 
data [6,34] can be used to discover distinct features of 
RBP biological functions. 

We observed an enrichment of unique and function- 
ally related GO terms for RBP-mRNA pairs associated 
with high interaction propensities and specific expres- 
sion patterns. In our analysis, co-expression of interact- 
ing mRNA-RBP pairs (IC set) is linked to regulation of 
proliferation and cell cycle control, while anti-expression 
(IA set) is a characteristic feature of survival, growth and 
differentiation-specific processes. We do not exclude 
that RBP-mRNA associations displaying poor inter- 
action propensities (NIC and NIA sets) might have im- 
portant evolutionary implications as spatiotemporal 
separation and limited chemical reactivity could be ways 
to avoid aberrant associations [55]. 

We found that RNA-binding proteins are enriched in 
structurally disordered regions and that unfolded poly- 
peptide fragments promote association with RNA mole- 
cules at low interaction propensities. As disordered 
proteins are highly reactive [37], it is reasonable to as- 
sume that interaction with RNA needs to be tightly reg- 
ulated to avoid cellular damage [39]. In this regard, our 
results expand at the nucleic acid level what has been 



previously observed for the general promiscuity of na- 
tively unfolded proteins [38,79]. 

In conclusion, we hope that our study of protein- RNA 
interaction and expression will be useful in the design of 
new experiments and for further characterizing ribonu- 
cleoprotein associations. A list of proposed interactions 
and a server for new inquiries are available at the catRA- 
PID express webpage [27] . 

Materials and methods 

Prediction for LIN28B and HuR interactions 

We performed a number of tests to assess the quality of 
our calculations (see section on RNA-binding protein- 
mRNA interaction propensity) using PAR-CLIP data 
[31,33]. In this analysis, we used all the RNA interac- 
tions present in our dataset (positive set: 285 sequences 
for LIN28B and 579 for HuR) and, due to the unavail- 
ability of non-bound RNAs, the full list of human tran- 
scripts (negative set: 105,000 sequences). 

For the significance of interaction predictions, we per- 
formed Fishers exact test comparing the top 1% of pre- 
dicted interactions with the remaining protein-RNA 
associations (HuR: P=10~ 8 ; LIN28B: P=10" 3 ). Fishers 
exact test was computed using equal amounts (that is, 
1% of the total interactions) of randomly extracted 
negative subsets (HuR: P=10~ 7 ; LIN28B: P = 0.0002; 
Additional file 2: Figure S3). 

For the significance of score distributions, we used 
Students £-test to compare the score distribution of pos- 
itives and negatives (HuR: P=10~ 16 ; LIN28B: P=10~ 4 ). 
We also performed Students £-test using random extrac- 
tions of negative subsets, each containing the same 
number of RNAs as positives (LIN28B: P = 0.03; HuR: 
P< 10" 8 ; Students £-test). 

Other statistical tests (receiver operating characteristics 
and precision/recall curves) are discussed in Additional 
file 1. The expression data for HuR and LIN28B were 
taken from the original manuscripts [31,33] and processed 
as indicated by the authors. The datasets were down- 
loaded from GEO [29] (GSE29943) and ArrayExpress [80] 
(E-GEOD-44615 and E-GEOD-44613). 

mRNA dataset: Human BodyMap 

The Human BodyMap (HBM) 2.0 contains expression 
data generated using the Hiseq 2000 system and it has 
expression profiles for a number of human tissues [22]. 
The HBM RNA sequencing (RNA-seq) data was down- 
loaded from ArrayExpress [81] under accession number 
E-MTAB-513. The final mRNA dataset contained 35,818 
transcripts (11,584 genes) with expression levels for 14 
human tissues (see section on RNA-binding protein- 
mRNA expression). We considered all human cDNAs 
from EnsEMBL release 68. Transcripts incompatible 
with the catRAPID size restrictions (that is, 50 to 1,200 
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nucleotides) or not expressed in at least one tissue were 
filtered out. In the analysis, we evaluated different CD- 
HIT [82] sequence similarity cutoff thresholds (see section 
on Gene ontology analysis). 

RNA-binding protein dataset: Human Protein Atlas 

We considered all the RBPs reported in two studies on 
RBPs binding to mRNAs [1,2]. The initial dataset con- 
sisted of 3,500 RBPs (832 genes). Proteins incompatible 
with catRAPIDs size restrictions (that is, 50 to 750 
amino acids) and above a CD-HIT [82] sequence simi- 
larity cutoff of 75% were filtered out. Similarly, proteins 
not present in the Human Protein Atlas (HPA) database 
(version 11.0) [34] and not expressed in at least one tis- 
sue were discarded. The final RBP (HPA) dataset con- 
tained 612 proteins (491 genes) with expression levels 
for 14 human tissues (see section on RNA-binding pro- 
tein-mRNA expression). All protein sequences were re- 
trieved from EnsEMBL release 68. 



RNA-binding protein dataset: Human BodyMap 

As for RBPs in the HPA, filters on sequence size and 
redundancy were applied. Proteins not present in the 
Human BodyMap database (version 2.0) [6] were dis- 
carded. The final RBP (HBM) dataset contained 1,156 
proteins (543 genes) with expression levels for 14 human 
tissues (see section on RNA-binding protein-mRNA 
expression). All protein sequences were retrieved from 
EnsEMBL release 68. 



RNA-binding protein-mRNA expression 

We analyzed 14 human tissues for which both immuno- 
histochemistry [34] and transcript abundances [6] were 
available. At present, the Human Protein Atlas is the lar- 
gest collection of protein abundance data available [34]. 
Transcripts in the mRNA dataset and proteins in the 
RBP dataset were represented by vectors containing the 
normalized relative abundance of the following tissues: 
adrenal gland, brain, breast, colon, heart, kidney, liver, 
lung, lymph, muscle, lymph node, ovary, prostate and 
thyroid. For the immunohistochemistry data, the read- 
outs 'no', low', 'intermediate' or 'high' expression were 
transformed into numbers (0, 1, 2, 3) and subject to 
Z-normalization per tissue. As for the transcript data, 
the vectors were Z-normalized using the average and 
standard deviation per tissue. For each RBP-mRNA 
combination we computed the pairwise Pearson s correl- 
ation coefficient of the vectors. As shown in Additional 
file 2: Figures SI and S2, we observed the same trends 
using immunohistochemistry [34] and RNA-seq data [6] 
to estimate protein abundances in human tissues. 



RNA-binding protein-mRNA interaction propensity 

We used catRAPID [19,20] to compute the interaction 
propensity of each protein in the RBP dataset with each 
transcript in the mRNA dataset. catRAPID predicts pro- 
tein- RNA associations by estimating the interaction 
propensity between amino acids and nucleotides using 
secondary structure information, hydrogen bonding and 
Van der Waals forces [19,20]. The approach was previ- 
ously applied to predict associations between different 
types of proteins and RNA molecules [21,23]. Although 
each protein binds to distinct types of RNA structures 
[83], we observe that the contribution of hairpin loops 
accounts for 57% of the overall interaction propensity 
[19]. The catRAPID web server is publicly accessible 
from our webpage [84]. 

Protein-RNA interaction and expression 

For a given protein, interacting (n int ) and non-interacting 
i^no-int) protein-RNA pairs were compared at different 
AUCs (areas under the curve) of the interaction propen- 
sity distribution. The enrichment in positively correlated 
expression (Figure 2C) is calculated as: 

enrichment (co-expressed interactions) 

Hint 0 > rth) ^no-int 

Hno-int 0 > rth) 

In Equation (1), the correlation coefficient r follows 
the distribution of protein-RNA expression and the par- 
ameter rth > 0 corresponds to an AUC spanning the 
range 50% to 99.5% (Figure 2B). 

Similarly, for negatively correlated expressions (Figure 2D): 

enrichment (anti-expressed interactions) 

= n- mt (r < /th)-ftno-int(r < *th) ^ 

Hno-int( r < *th) 

In Equation (2), the parameter l th < 0 corresponds to 
an AUC spanning the range 50% to 99.5% (Figure 2B). 

Gene ontology analysis 

For each area under the curve (AUC) of the catRAPID 
score distribution (50% < AUC < 99.5%), we created four 
subsets according to the correlation in tissue expression: 
(1) IC subset: positively correlating and interacting genes 
(expression correlation > +0.7 and positive interaction pro- 
pensities); (2) IA subset: negatively correlating and inter- 
acting genes (expression correlation < -0.7 and positive 
interaction propensities); (3) NIC subset: positively correl- 
ating and non-interacting genes (expression correlation 
> + 0.7 and negative interaction propensities); (4) NIA 
subset: negatively correlating and non-interacting genes 
(expression correlation < -0.7 and negative interaction 
propensities). The expression correlation of |0.7| corre- 
sponds to AUC = 95% of the statistical distribution, for 
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which we found the highest enrichments (Figure 2C,D). 
We systematically applied conditional tests for GO term 
over-representation in each subset using the GOStats 
package (version 2.28.0) available from Bioconductor [85]. 
To assess the over-representation of a GO term in one 
particular subset at a certain AUC, we considered five cri- 
teria (Additional file 3: Table SI; Additional file 6: Table 
S4; Additional file 2: Figure S6): 

1. The GO term must be reported for more than two 
genes. 

2. The P value of the GO term must be significant 
(P < 0.05) in the subset of interest and 
non-significant (P> 0.1) in the others. 

3. The enrichment must be conserved with respect to: 
(a) the entire human transcriptome (that is, 
including RNAs longer than 1,200 nucleotides and 
independently of expression data), (b) the complete 
set of analyzed genes (that is, including RNAs 
shorter than 1,200 nucleotides and with available 
expression) and (c) all genes under the same AUC 
(that is, considering both interacting and non- 
interacting pairs at the two tails of the distribution). 

4. The P value of the GO term must be non-significant 
(P> 0.1) in: (a) the complete set of analyzed genes 
compared to the human transcriptome (significance 
would indicate enrichment irrespective of the subset 
assignment) and (b) the list of transcripts compatible 
with catRAPID length requirements compared to 
the human transcriptome (significance would 
indicate length bias in the statistics; see section 

on length bias statistics). 

5. The enrichment must be conserved after sequence 
redundancy reduction to the 80% identity threshold. 

Length bias statistics 

Due to the conformational space of nucleotide chains, 
prediction of RNA secondary structures is difficult when 
RNA sequences are > 1,200 nucleotides and simulations 
cannot be completed on standard processors (2.5 GHz; 4 
to 8 GB memory). To see whether GO enrichment is 
biased by the catRAPID length restriction, we used a 
hypergeometric test (see section on the RNA-binding pro- 
tein-mRNA interaction propensity). If a GO term is 
enriched in the length-restricted set, it is excluded a priori 
from the analysis because genes annotated in that GO 
term would be only selected for the length range. Thus, 
we imposed that GO terms must be non- significant 
(P > 0.1) in the length-restricted set of genes (see section 
on gene ontology analysis). This condition ensures that 
there is no bias due to length restrictions for any GO 
term enriched in a particular subset (Additional file 3: 
Table SI). 



Analysis of RNA-binding protein sequence disorder 

The content of disordered regions in the RBP sequences 
was computed using IUPred [36]. For each protein, we 
extracted structurally disordered regions (IUPred score 
higher than 0.4) and calculated their interactions against 
the reference transcriptome. We compared the inter- 
action propensities of each disordered region with that 
of the full-length protein and assessed if there was an in- 
crease or decrease of the interaction propensity score 
(Figure 3A). The contribution of the disordered region 
was evaluated using a Mann- Whitney U test, where a 
significant increase (P < 0.05; H 0 < in the interaction 
propensity score is associated with a positive contribu- 
tion. From low to high interaction propensities, the con- 
tribution of disorder decreases progressively with respect 
to that of the full-length proteins (Figure 3A). The role 
of disorder is more pronounced in proteins lacking 
canonical RNA-binding domains, indicating that un- 
structured regions have a direct involvement in contact- 
ing RNA (Figure 3B). 

Web server 

catRAPID express [27] is a publicly available implemen- 
tation of catRAPID [19,20], which is used to study the 
relation between protein-RNA interaction propensity 
and expression in Homo sapiens. The tool has two com- 
ponents: (1) catRAPID predictions of protein-RNA 
interaction and (2) the computation of correlation using 
protein and RNA expression profiles [6,34]. A descrip- 
tion of how catRAPID makes predictions can be found 
in the Documentation, Tutorial and Frequently Asked 
Questions (FAQs) on the webpage. Expression profiles of 
the RBP dataset and mRNA dataset are assigned respect- 
ively to input proteins and RNA using a homology- 
based criterion (ten top-ranked proteins with a BLAST 
[86] e < 0.01 and >75% whole sequence similarity; ten 
top-ranked transcripts with a BLAST e < 0.01 and >95% 
whole sequence similarity). Sequence similarity is evalu- 
ated using the Needleman-Wunsch algorithm [87]. 

Additional files 



Additional file 1: Additional materials and methods. 

Additional file 2: Figure SI. With respect to non-interacting protein-RNA 
pairs, non-correlated protein-RNA expression does not show enrichment 
using (A) RNA and (B) protein expression. Areas under the curve (AUCs) 
were used to select the same number of interacting/non-interacting and 
positively/negatively expressed protein-RNA pairs for the analysis. 
Figure S2. Protein-RNA interaction and expression (immunohistochemistry 
expression data). (A) With respect to non-interacting protein-RNA pairs, 
predicted associations had enriched positively correlated expression. 
(B) Compared to non-interacting protein-RNA pairs, predicted associations 
had enriched negatively correlated expression. Figure S3. P value 
distribution for HuR and LIN28B predictions. We compared P values 
(Fisher's exact test) for the catRAPID predictions for HuR and LIN28B 
RNA interactions (red arrow) using balanced bootstrap resampling 
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(random extractions of negative subsets with the same amount as the 
positive subset). The predicted interactions differ significantly from 
random associations. Figure S4. Receiver operating characteristic (ROC) 
and precision/recall (PR) curves for HuR and LIN28B predictions. We 
evaluated changes in the ROC and PR curves for the catRAPID predictions 
for the (A) HuR and (B) LIN28B RNA interactome for random samples using 
several ratios of positive and negative associations (pos/neg ratios). 
Figure S5. Examples of protein-RNA anti-expression scenarios. (A) We 
propose that YTHDC1 represses the expression of tumor-associated genes 
by destabilizing mRNAs. (B) Nodal expression in adult tissues is associated 
with tumor progression, which might be due to transcript stabilization. 
Figure S6. Nested representation of gene sets used in GO enrichment 
analysis. Figure S7. Changes in transcript and gene counts after sequence 
redundancy reduction. The mRNA database comprises 35,818 transcripts 
(1 1,584 genes). After redundancy filtering, the mRNA database is reduced to 
33,936 transcripts (1 1,483 genes) at 95% sequence identity threshold; 32,700 
transcripts (1 1,406 genes) at 90%; 31,287 transcripts (1 1360 genes) at 85% 
and 29,673 transcripts (1 1,317 genes) at 80%. 

Additional file 3: Table SI. mRNA GO-term enrichment analysis of 
interacting mRNA-RBP pairs (P values). Every GO term has been tested 
for over-representation for each subset (IC, IA, NIC and NIA) with respect 
to the human transcriptome, the complete set of analyzed genes 
(analyzed mRNA set) and the analyzed genes with the same AUC (relative 
AUC subset). Statistical control for the biased over-representation of 
GO terms in the complete set of analyzed genes and in a catRAPID 
length-restricted set of genes was used for the human transcriptome 
(see Materials and methods). Significant P values for the IC and IA subsets 
are shown in red. GOBP, gene ontology biological process; GOCC, gene 
ontology cellular component; GOMF, gene ontology molecular function; 
IA, interacting and anti-correlated in expression; IC, interacting and 
correlated in expression; NIC, not interacting and correlated in expression; 
Seq%id, sequence identity threshold used for redundancy reduction. 

Additional file 4: Table S2. NAP I LI read counts from TIA-1 iCLIP data. 
The count of reads mapping into the NAP1L1 gene and the relative 
cumulative distribution functions (cdf) are reported from the iCLIP 
experiment for controls and replicates of the TIA-1 protein [20]. The read 
count cdf was estimated after removal of genes with zero counts. 

Additional file 5: Table S3. Nodal anti-expressed interacting (IA) RBPs. 

Additional file 6: Table S4. mRNA GO-term enrichment analysis of 
interacting mRNA-RBP pairs (IC and IA gene counts are reported). Every 
GO term has been tested for over-representation in the IC and IA subsets 
with respect to the human transcriptome, the complete set of analyzed 
genes (analyzed mRNAs set) and the analyzed genes under the same 
AUC (relative AUC subset).GOBP, gene ontology biological process; GOCC, 
gene ontology cellular component; GOMF, gene ontology molecular 
function; IA, interacting and anti-correlated in expression; IC, interacting 
and correlated in expression; Seq%id, sequence identity threshold used 
for redundancy reduction. 
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